Das Ziel ist es, aus dem Datacamp Datensatz [Video Game Sales Date] (https://app.datacamp.com/workspace/datasets/dataset-python-video-games-sales), , folgende Fragestellung / Hypothese zu beantworten:

Fragestellung: “Ist es wahrscheinlicher, dass sich bestimmte Spielgenres in Europa signifikant (Unterschied von 50%) besser verkaufen lassen als im Japanischen und Nordamerikanischen Markt?”

Als Einführung werden wir auf Datacamp folgende Kurse durchgehen:

# Import libraries
library("plotly")
library("ggplot2")
library("plyr")
library("dplyr")
# Read csv from folder "data"
df = read.csv("./data/video_games_data.csv")

head(df, 10)

Data Wrangling

Bevor wir mit den Visualisierungen und Modelle beginnen können, müssen wir die Daten säubern. Das heisst es sollte keine Duplikate geben, fehlende Werte sollten korrekt eingetragen werden und Daten, die nicht verwendet werden sollten gelöscht werden.

Es hat “N/A” Werte in den Spalten “Year” und “Publisher”. Diese Werte sollten korrekte “NA” Werte sein, damit sie bei den Visualisierungen und Berechnungen nicht berücksichtigt werden.

# Show rows with "N/A" values
df[grep("N/A", df$Publisher),]
df[grep("N/A", df$Year),]
# Replace "N/A" with "NA"
df[df == "N/A"] <- NA
# Check if values have been converted
df %>% 
  summarize(across(everything(), ~sum(is.na(.))))

Da 2017 nur 3 Einträge und 2020 nur 1 Eintrag beinhaltet, werden wir diese Jahren nicht berücksichtigen und aus dem Dataframe löschen, weil sie nicht vollständig sind und so könnten unsere Modelle ungenau werden.

# Remove years 2017 and 2020 from dataset
df_clean <- df[!(df$Year == "2017" | df$Year == "2020"),]
View(df_clean)
# Remove columns: Rank, Game name, Platform, Global Sales because they are not needed for our thesis
# Set data to correct type
df_clean$Genre <- as.factor(df_clean$Genre)
df_clean$Year <- as.numeric(df_clean$Year)

Erste Plots erstellen

na <- sum(df_clean[, 'NA_Sales'], na.rm = TRUE)
eu <- sum(df_clean[, 'EU_Sales'], na.rm = TRUE)
jp <- sum(df_clean[, 'JP_Sales'], na.rm = TRUE)
o <- sum(df_clean[, 'Other_Sales'], na.rm = TRUE)
g <- sum(df_clean[, 'Global_Sales'], na.rm = TRUE)

fig <- plot_ly(
  y = c(na, eu, jp, o), 
  x = c("North America", "Europe", "Japan", "Other"), 
  type = 'bar',
  width = 800
)

fig <- fig %>% layout(title = "Video Game Sales Overview",
         xaxis = list(title = "Region"),
         yaxis = list(title = "Sales (million)"))

fig
# group by genre and summarize game sales to each region
df_genre <- df_clean %>%
  group_by(Genre) %>%
  summarize(
    NA_Sales_Sum = sum(NA_Sales),
    EU_Sales_Sum = sum(EU_Sales), 
    JP_Sales_Sum = sum(JP_Sales),
    Other_Sales_Sum = sum(Other_Sales),
    Global_Sales_Sum = sum(Global_Sales)
  )

df_genre

fig <- plot_ly(
  df_genre, y = ~Genre, x = ~NA_Sales_Sum, type = "bar", name = "North America", width = 1000, height = 800) %>% 
  add_trace(x = ~EU_Sales_Sum, name = "Europe") %>%
  add_trace(x = ~JP_Sales_Sum, name = "Japan") %>%
  add_trace(x = ~Other_Sales_Sum, name = "Other") %>%
  layout(
    title = "Video Game Sales by Genre",
    xaxis = list(title = "Sales (million)"),
    barmode = "group"
  )

fig
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
df_year <- df_clean %>%
  group_by(Year) %>%
  summarize(
    NA_Sales_Sum = sum(NA_Sales),
    EU_Sales_Sum = sum(EU_Sales), 
    JP_Sales_Sum = sum(JP_Sales),
    Other_Sales_Sum = sum(Other_Sales),
    Global_Sales_Sum = sum(Global_Sales)
  )

fig <- plot_ly(
  df_year, y = ~NA_Sales_Sum, x = ~Year, type = "bar", name = "North America", width = 900, height = 500) %>% 
  add_trace(y = ~EU_Sales_Sum, name = "Europe") %>%
  add_trace(y = ~JP_Sales_Sum, name = "Japan") %>%
  add_trace(y = ~Other_Sales_Sum, name = "Other") %>%
  layout(
    title = "Video Game from Sales by Year",
    xaxis = list(title = "Year"),
    yaxis = list(title = "Sales (million)"),
    barmode = "stack"
  )

fig
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations

Regressionsmodelle

Da wir unsere Daten jetzt besser verstehen, können wir mit den Regressionsmodellen und mit der Beantwortung unserer Fragestellung beginnen.

# calculate the average of sales of each genre from each region
df_sales_avg <- df_clean %>%
  group_by(Genre) %>%
  summarise(
    EU_Sales_Avg = mean(EU_Sales),
    NA_Sales_Avg = mean(NA_Sales),
    JP_Sales_Avg = mean(JP_Sales),
    Other_Sales_Avg = mean(Other_Sales),
    Global_Sales_Avg = mean(Global_Sales))

df_sales_avg
# Normalise data? -> NA obviously has more sales (biggest market)
LS0tDQp0aXRsZTogIlJlZ3Jlc3Npb24gbW9kZWxzIHdpdGggUiINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCkRhcyBaaWVsIGlzdCBlcywgYXVzIGRlbSBEYXRhY2FtcCBEYXRlbnNhdHogW1ZpZGVvIEdhbWUgU2FsZXMgRGF0ZV0gKGh0dHBzOi8vYXBwLmRhdGFjYW1wLmNvbS93b3Jrc3BhY2UvZGF0YXNldHMvZGF0YXNldC1weXRob24tdmlkZW8tZ2FtZXMtc2FsZXMpLCAsIGZvbGdlbmRlIEZyYWdlc3RlbGx1bmcgLyBIeXBvdGhlc2UgenUgYmVhbnR3b3J0ZW46DQoNCg0KIyMjIEZyYWdlc3RlbGx1bmc6ICJJc3QgZXMgd2FocnNjaGVpbmxpY2hlciwgZGFzcyBzaWNoIGJlc3RpbW10ZSBTcGllbGdlbnJlcyBpbiBFdXJvcGEgc2lnbmlmaWthbnQgKFVudGVyc2NoaWVkIHZvbiA1MCUpIGJlc3NlciB2ZXJrYXVmZW4gbGFzc2VuIGFscyBpbSBKYXBhbmlzY2hlbiB1bmQgTm9yZGFtZXJpa2FuaXNjaGVuIE1hcmt0PyINCg0KDQpBbHMgRWluZsO8aHJ1bmcgd2VyZGVuIHdpciBhdWYgRGF0YWNhbXAgZm9sZ2VuZGUgS3Vyc2UgZHVyY2hnZWhlbjoNCg0KLSBbSW50cm9kdWN0aW9uIHRvIFJlZ3Jlc3Npb24gaW4gUl0oaHR0cHM6Ly9hcHAuZGF0YWNhbXAuY29tL2xlYXJuL2NvdXJzZXMvaW50cm9kdWN0aW9uLXRvLXJlZ3Jlc3Npb24taW4tcikNCg0KLSBbSW50ZXJtZWRpYXRlIFJlZ3Jlc3Npb24gaW4gUl0oaHR0cHM6Ly9hcHAuZGF0YWNhbXAuY29tL2xlYXJuL2NvdXJzZXMvaW50ZXJtZWRpYXRlLXJlZ3Jlc3Npb24taW4tcikNCg0KDQpgYGB7cn0NCiMgSW1wb3J0IGxpYnJhcmllcw0KbGlicmFyeSgicGxvdGx5IikNCmxpYnJhcnkoImdncGxvdDIiKQ0KbGlicmFyeSgicGx5ciIpDQpsaWJyYXJ5KCJkcGx5ciIpDQpgYGANCg0KYGBge3J9DQojIFJlYWQgY3N2IGZyb20gZm9sZGVyICJkYXRhIg0KZGYgPSByZWFkLmNzdigiLi9kYXRhL3ZpZGVvX2dhbWVzX2RhdGEuY3N2IikNCg0KaGVhZChkZiwgMTApDQpgYGANCiMjIyBEYXRhIFdyYW5nbGluZw0KQmV2b3Igd2lyIG1pdCBkZW4gVmlzdWFsaXNpZXJ1bmdlbiB1bmQgTW9kZWxsZSBiZWdpbm5lbiBrw7ZubmVuLCBtw7xzc2VuIHdpciBkaWUgRGF0ZW4gc8OkdWJlcm4uIERhcyBoZWlzc3QgZXMgc29sbHRlIGtlaW5lIER1cGxpa2F0ZSBnZWJlbiwgZmVobGVuZGUgV2VydGUgc29sbHRlbiBrb3JyZWt0IGVpbmdldHJhZ2VuIHdlcmRlbiB1bmQgRGF0ZW4sIGRpZSBuaWNodCB2ZXJ3ZW5kZXQgd2VyZGVuIHNvbGx0ZW4gZ2Vsw7ZzY2h0IHdlcmRlbi4NCg0KRXMgaGF0ICJOL0EiIFdlcnRlIGluIGRlbiBTcGFsdGVuICJZZWFyIiB1bmQgIlB1Ymxpc2hlciIuIERpZXNlIFdlcnRlIHNvbGx0ZW4ga29ycmVrdGUgIk5BIiBXZXJ0ZSBzZWluLCBkYW1pdCBzaWUgYmVpIGRlbiBWaXN1YWxpc2llcnVuZ2VuIHVuZCBCZXJlY2hudW5nZW4gbmljaHQgYmVyw7xja3NpY2h0aWd0IHdlcmRlbi4NCg0KYGBge3J9DQojIFNob3cgcm93cyB3aXRoICJOL0EiIHZhbHVlcw0KZGZbZ3JlcCgiTi9BIiwgZGYkUHVibGlzaGVyKSxdDQpkZltncmVwKCJOL0EiLCBkZiRZZWFyKSxdDQpgYGANCg0KYGBge3J9DQojIFJlcGxhY2UgIk4vQSIgd2l0aCAiTkEiDQpkZltkZiA9PSAiTi9BIl0gPC0gTkENCmBgYA0KDQpgYGB7cn0NCiMgQ2hlY2sgaWYgdmFsdWVzIGhhdmUgYmVlbiBjb252ZXJ0ZWQNCmRmICU+JSANCiAgc3VtbWFyaXplKGFjcm9zcyhldmVyeXRoaW5nKCksIH5zdW0oaXMubmEoLikpKSkNCmBgYA0KRGEgMjAxNyBudXIgMyBFaW50csOkZ2UgdW5kIDIwMjAgbnVyIDEgRWludHJhZyBiZWluaGFsdGV0LCB3ZXJkZW4gd2lyIGRpZXNlIEphaHJlbiBuaWNodCBiZXLDvGNrc2ljaHRpZ2VuIHVuZCBhdXMgZGVtIERhdGFmcmFtZSBsw7ZzY2hlbiwgd2VpbCBzaWUgbmljaHQgdm9sbHN0w6RuZGlnIHNpbmQgdW5kIHNvIGvDtm5udGVuIHVuc2VyZSBNb2RlbGxlIHVuZ2VuYXUgd2VyZGVuLg0KDQpgYGB7cn0NCiMgUmVtb3ZlIHllYXJzIDIwMTcgYW5kIDIwMjAgZnJvbSBkYXRhc2V0DQpkZl9jbGVhbiA8LSBkZlshKGRmJFllYXIgPT0gIjIwMTciIHwgZGYkWWVhciA9PSAiMjAyMCIpLF0NClZpZXcoZGZfY2xlYW4pDQpgYGANCg0KYGBge3J9DQojIFJlbW92ZSBjb2x1bW5zOiBSYW5rLCBHYW1lIG5hbWUsIFBsYXRmb3JtLCBHbG9iYWwgU2FsZXMgYmVjYXVzZSB0aGV5IGFyZSBub3QgbmVlZGVkIGZvciBvdXIgdGhlc2lzDQpgYGANCg0KYGBge3J9DQojIFNldCBkYXRhIHRvIGNvcnJlY3QgdHlwZQ0KZGZfY2xlYW4kR2VucmUgPC0gYXMuZmFjdG9yKGRmX2NsZWFuJEdlbnJlKQ0KZGZfY2xlYW4kWWVhciA8LSBhcy5udW1lcmljKGRmX2NsZWFuJFllYXIpDQpgYGANCg0KIyMjIEVyc3RlIFBsb3RzIGVyc3RlbGxlbg0KDQpgYGB7cn0NCm5hIDwtIHN1bShkZl9jbGVhblssICdOQV9TYWxlcyddLCBuYS5ybSA9IFRSVUUpDQpldSA8LSBzdW0oZGZfY2xlYW5bLCAnRVVfU2FsZXMnXSwgbmEucm0gPSBUUlVFKQ0KanAgPC0gc3VtKGRmX2NsZWFuWywgJ0pQX1NhbGVzJ10sIG5hLnJtID0gVFJVRSkNCm8gPC0gc3VtKGRmX2NsZWFuWywgJ090aGVyX1NhbGVzJ10sIG5hLnJtID0gVFJVRSkNCmcgPC0gc3VtKGRmX2NsZWFuWywgJ0dsb2JhbF9TYWxlcyddLCBuYS5ybSA9IFRSVUUpDQoNCmZpZyA8LSBwbG90X2x5KA0KICB5ID0gYyhuYSwgZXUsIGpwLCBvKSwgDQogIHggPSBjKCJOb3J0aCBBbWVyaWNhIiwgIkV1cm9wZSIsICJKYXBhbiIsICJPdGhlciIpLCANCiAgdHlwZSA9ICdiYXInLA0KICB3aWR0aCA9IDgwMA0KKQ0KDQpmaWcgPC0gZmlnICU+JSBsYXlvdXQodGl0bGUgPSAiVmlkZW8gR2FtZSBTYWxlcyBPdmVydmlldyIsDQogICAgICAgICB4YXhpcyA9IGxpc3QodGl0bGUgPSAiUmVnaW9uIiksDQogICAgICAgICB5YXhpcyA9IGxpc3QodGl0bGUgPSAiU2FsZXMgKG1pbGxpb24pIikpDQoNCmZpZw0KYGBgDQoNCmBgYHtyfQ0KIyBncm91cCBieSBnZW5yZSBhbmQgc3VtbWFyaXplIGdhbWUgc2FsZXMgdG8gZWFjaCByZWdpb24NCmRmX2dlbnJlIDwtIGRmX2NsZWFuICU+JQ0KICBncm91cF9ieShHZW5yZSkgJT4lDQogIHN1bW1hcml6ZSgNCiAgICBOQV9TYWxlc19TdW0gPSBzdW0oTkFfU2FsZXMpLA0KICAgIEVVX1NhbGVzX1N1bSA9IHN1bShFVV9TYWxlcyksIA0KICAgIEpQX1NhbGVzX1N1bSA9IHN1bShKUF9TYWxlcyksDQogICAgT3RoZXJfU2FsZXNfU3VtID0gc3VtKE90aGVyX1NhbGVzKSwNCiAgICBHbG9iYWxfU2FsZXNfU3VtID0gc3VtKEdsb2JhbF9TYWxlcykNCiAgKQ0KDQpkZl9nZW5yZQ0KDQpmaWcgPC0gcGxvdF9seSgNCiAgZGZfZ2VucmUsIHkgPSB+R2VucmUsIHggPSB+TkFfU2FsZXNfU3VtLCB0eXBlID0gImJhciIsIG5hbWUgPSAiTm9ydGggQW1lcmljYSIsIHdpZHRoID0gMTAwMCwgaGVpZ2h0ID0gODAwKSAlPiUgDQogIGFkZF90cmFjZSh4ID0gfkVVX1NhbGVzX1N1bSwgbmFtZSA9ICJFdXJvcGUiKSAlPiUNCiAgYWRkX3RyYWNlKHggPSB+SlBfU2FsZXNfU3VtLCBuYW1lID0gIkphcGFuIikgJT4lDQogIGFkZF90cmFjZSh4ID0gfk90aGVyX1NhbGVzX1N1bSwgbmFtZSA9ICJPdGhlciIpICU+JQ0KICBsYXlvdXQoDQogICAgdGl0bGUgPSAiVmlkZW8gR2FtZSBTYWxlcyBieSBHZW5yZSIsDQogICAgeGF4aXMgPSBsaXN0KHRpdGxlID0gIlNhbGVzIChtaWxsaW9uKSIpLA0KICAgIGJhcm1vZGUgPSAiZ3JvdXAiDQogICkNCg0KZmlnDQpgYGANCg0KYGBge3J9DQpkZl95ZWFyIDwtIGRmX2NsZWFuICU+JQ0KICBncm91cF9ieShZZWFyKSAlPiUNCiAgc3VtbWFyaXplKA0KICAgIE5BX1NhbGVzX1N1bSA9IHN1bShOQV9TYWxlcyksDQogICAgRVVfU2FsZXNfU3VtID0gc3VtKEVVX1NhbGVzKSwgDQogICAgSlBfU2FsZXNfU3VtID0gc3VtKEpQX1NhbGVzKSwNCiAgICBPdGhlcl9TYWxlc19TdW0gPSBzdW0oT3RoZXJfU2FsZXMpLA0KICAgIEdsb2JhbF9TYWxlc19TdW0gPSBzdW0oR2xvYmFsX1NhbGVzKQ0KICApDQoNCmZpZyA8LSBwbG90X2x5KA0KICBkZl95ZWFyLCB5ID0gfk5BX1NhbGVzX1N1bSwgeCA9IH5ZZWFyLCB0eXBlID0gImJhciIsIG5hbWUgPSAiTm9ydGggQW1lcmljYSIsIHdpZHRoID0gOTAwLCBoZWlnaHQgPSA1MDApICU+JSANCiAgYWRkX3RyYWNlKHkgPSB+RVVfU2FsZXNfU3VtLCBuYW1lID0gIkV1cm9wZSIpICU+JQ0KICBhZGRfdHJhY2UoeSA9IH5KUF9TYWxlc19TdW0sIG5hbWUgPSAiSmFwYW4iKSAlPiUNCiAgYWRkX3RyYWNlKHkgPSB+T3RoZXJfU2FsZXNfU3VtLCBuYW1lID0gIk90aGVyIikgJT4lDQogIGxheW91dCgNCiAgICB0aXRsZSA9ICJWaWRlbyBHYW1lIGZyb20gU2FsZXMgYnkgWWVhciIsDQogICAgeGF4aXMgPSBsaXN0KHRpdGxlID0gIlllYXIiKSwNCiAgICB5YXhpcyA9IGxpc3QodGl0bGUgPSAiU2FsZXMgKG1pbGxpb24pIiksDQogICAgYmFybW9kZSA9ICJzdGFjayINCiAgKQ0KDQpmaWcNCmBgYA0KDQojIyMgUmVncmVzc2lvbnNtb2RlbGxlDQoNCkRhIHdpciB1bnNlcmUgRGF0ZW4gamV0enQgYmVzc2VyIHZlcnN0ZWhlbiwga8O2bm5lbiB3aXIgbWl0IGRlbiBSZWdyZXNzaW9uc21vZGVsbGVuIHVuZCBtaXQgZGVyIEJlYW50d29ydHVuZyB1bnNlcmVyIEZyYWdlc3RlbGx1bmcgYmVnaW5uZW4uDQoNCmBgYHtyfQ0KIyBjYWxjdWxhdGUgdGhlIGF2ZXJhZ2Ugb2Ygc2FsZXMgb2YgZWFjaCBnZW5yZSBmcm9tIGVhY2ggcmVnaW9uDQpkZl9zYWxlc19hdmcgPC0gZGZfY2xlYW4gJT4lDQogIGdyb3VwX2J5KEdlbnJlKSAlPiUNCiAgc3VtbWFyaXNlKA0KICAgIEVVX1NhbGVzX0F2ZyA9IG1lYW4oRVVfU2FsZXMpLA0KICAgIE5BX1NhbGVzX0F2ZyA9IG1lYW4oTkFfU2FsZXMpLA0KICAgIEpQX1NhbGVzX0F2ZyA9IG1lYW4oSlBfU2FsZXMpLA0KICAgIE90aGVyX1NhbGVzX0F2ZyA9IG1lYW4oT3RoZXJfU2FsZXMpLA0KICAgIEdsb2JhbF9TYWxlc19BdmcgPSBtZWFuKEdsb2JhbF9TYWxlcykpDQoNCmRmX3NhbGVzX2F2Zw0KYGBgDQpgYGB7cn0NCiMgTm9ybWFsaXplIGRhdGE/IC0+IE5BIG9idmlvdXNseSBoYXMgbW9yZSBzYWxlcyAoYmlnZ2VzdCBtYXJrZXQpDQpgYGANCg0KDQoNCg==